120 research outputs found

    Manuscripts and machines: the automatic replacement of spelling variants in a Portuguese historical corpus

    Get PDF
    The CARDS-FLY project aims to collect and transcribe a diverse sample of historical personal letters from the 16th to 20th century in a digital format to create a linguistic resource for the historical study of the Portuguese language and society. The letters were written by people from all social layers of society and their historical, social and pragmatic contexts are documented in the digital format. Here we study one particular aspect of this collection, namely the spelling variation. Furthermore, on the basis of this analysis, we improved a statistical spelling normalisation tool that we aim to use to automatically normalise the spelling in the full collection of digitised letters.info:eu-repo/semantics/publishedVersio

    Modality in Text: a Proposal for Corpus Annotation

    Get PDF
    We present a annotation scheme for modality in Portuguese. In our annotation scheme we have tried to combine a more theoretica llinguistic view point with a practical annotation scheme that will also be useful for NLP research but is not geared towards one specificapplication. Our notion of modality focuses on the attitude and opinion of the speaker or of the subject of the sentence. We validated the annotation scheme on a corpus sample of approximately 2000 sentences that we fully annotated with modal information using the MMAX2 annotation tool to produce XML annotation. We discuss our main findings and pay attention to the difficult cases that we encountered as they illustrate the complexity of modality and its interactions with other elements in the text.info:eu-repo/semantics/publishedVersio

    A large Portuguese corpus on-line: cleaning and preprocessing

    Get PDF
    We present a newly available on-line resource for Portuguese,a corpus of 310 million words, a new version of the Reference Corpus of Contemporary Portuguese, now searchable via a user-friendly web interface. Here we report on work carried out on the corpus previous toits publication on-line. We focus on the processes and tools involved for the cleaning, preparation and annotation to make the corpus suitable for linguistic inquiries.info:eu-repo/semantics/publishedVersio

    Proposal for Multi-word Expression annotation in running text

    Get PDF
    We present a proposal for the annotation of multi-word expressions in a 1M corpus of contemporary portuguese. Our aim is to create a resource that allows us to study multi-word expressions (MWEs) in their context. The corpus will be a valuable additional resource next to the already existing MWE lexicon that was based on a much larger corpus of 50M words. In this paper we discuss the problematic cases for annotation and proposed solutions, focusing on the variational properties of MWEs.info:eu-repo/semantics/publishedVersio

    Towards a unified approach to modality annotation in portuguese

    Get PDF
    Abstract: This paper introduces the first efforts towards a common ground for modality annotation for Portuguese. We take into account two existing schemes for European and Brazilian Portuguese, already implemented to written texts, and to spontaneous speech data, respectively. We compare the two schemes, discuss their strengths and weaknesses, and, then, introduce our unifying proposal, pointing out the issues which seem to be already pacified and points that should be considered when the scheme starts to be implemented.info:eu-repo/semantics/publishedVersio

    A Corpus of Santome

    Get PDF
    We present the process of constructing a corpus of spoken and written material for Santome, a Portuguese-related creole language spoken on the island of S. Tomé in the Gulf of Guinea (Africa). Since the language lacks an official status, we faced the typical difficulties, such as language variation, lack of standard spelling, lack of basic language instruments, and only a limited data set. The corpus comprises data from the second half of the 19th century until the present. For the corpus compilation we followed corpus linguistics standards and used UTF-8 character encoding and XML to encode meta information. We discuss how we normalized all material to one spelling, how we dealt with cases of language variation, and what type of meta data is used. We also present a POS-tag set developed for the Santome language that will be used to annotate the data with linguistic information.info:eu-repo/semantics/publishedVersio

    Avanços nas humanidades digitais

    Get PDF
    Neste capítulo acompanham-se os avanços da filologia do português desde que o ambiente digital se começou a anunciar como o contexto mais apropriado para a circulação do conhecimento. Remonta-se às primeiras experiências de processamento mecânico de textos portugueses, quando se entreviam já duas grandes vantagens no auxílio informático para efeitos de estudo histórico da língua: prevenção de erro humano em transcrições e edições e prevenção de abandono de tarefas demasiado gigantescas para a capacidade humana. Acompanha-se uma fase ulterior, em que os académicos, a nível internacional, deixaram de instrumentalizar apenas o digital para passarem a harmonizar-se com ele, tentando compreender quantos conceitos e métodos é preciso revolucionar para que a filologia possa continuar a cumprir a responsabilidade de disciplina que se ocupa da peritagem dos textos e do seu diálogo com a história da cultura e a história da língua. Analisam-se aqueles modelos de edição académica que correspondem, por terem codificação explícita e consistente, ao imperativo da legibilidade por máquina, ao mesmo tempo que permitem, fruto da linguagem de marcação e da anotação rica que adotam, uma crescente manipulação das suas representações computacionais. E demonstra-se como a filologia do português ganhou um ritmo acelerado de experimentação a este nível.info:eu-repo/semantics/publishedVersio

    Complex Predicates annotation in a corpus of Portuguese

    Get PDF
    We present an annotation scheme for the annotation of complex predicates, understood as constructions with more than one lexical unit, each contributing part of the information normally associated with a single predicate. We discuss our annotation guidelines of four types of complex predicates, and the treatment of several difficult cases, related to ambiguity, overlap and coordination. We then discuss the process of marking up the Portuguese CINTIL corpus of 1M tokens (written and spoken) with a new layer of information regarding complex predicates. We also present the outcomes of the annotation work and statistics on the types of CPs that we found in the corpus.info:eu-repo/semantics/publishedVersio
    corecore